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Abstract 


The  ability  to  predict  the  intentions  of  people  based  solely  on  their  visual  actions  is  a  skill  only 
performed  by  humans  and  animals.  This  requires  segmentation  of  items  in  the  field  of  view, 
tracking  of  moving  objects,  identifying  the  importance  of  each  object,  determining  the  current  role 
of  each  important  object  individually  and  in  collaboration  with  other  objects,  relating  these  objects 
into  a  predefined  scenario,  assessing  the  selected  scenario  with  the  information  retrieve,  and  finally 
adjusting  the  scenario  to  better  fit  the  data.  This  is  all  accomplished  with  great  accuracy  in  less 
than  a  few  seconds. 

The  intelligence  of  current  computer  algorithms  has  not  reached  this  level  of  complexity  with  the 
accuracy  and  time  constraints  that  humans  and  animals  have,  but  there  are  several  research  efforts 
that  are  working  towards  this  by  identifying  new  algorithms  for  solving  parts  of  this  problem. 

This  survey  paper  lists  several  of  these  efforts  that  rely  mainly  on  understanding  the  image 
processing  and  classification  of  a  limited  number  of  actions.  It  divides  the  activities  up  into 
several  groups  and  ends  with  a  discussion  of  future  needs. 


Keywords:  visual  human  action  classification,  artificial  intelligence,  hidden 
markov  model,  grammars. 
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Introduction 


Visual  Human  Action  Recognition  (VHAR)  is  an  important  and  complex 
artificial  and  computation  intelligence  process  used  to  help  solve  many 
understanding  problems.  If  a  system  were  to  be  developed  to  recognize  violent 
activity  in  a  subway  station  [13]  then  VHAR  would  be  used  to  identify  actions  of 
each  person  shown  in  the  field  of  view.  These  actions  would  be  combined 
through  other  algorithms  to  determine  hostility  of  people  and  possible  warn 
operators  of  a  violent  action  occurring.  Despite  the  need  for  VHAR  systems  the 
research  in  this  area  has  not  increased  to  the  level  of  maturity  desired,  unlike 
several  other  artificial  and  computationally  intelligent  areas  which  are  very  mature 
in  development.  Providing  the  framework  for  a  robust  VHAR  algorithm  is 
difficult  when  thinking  about  all  the  different  types  of  possible  actions  the 
algorithm  has  to  be  able  to  handle.  People  performing  the  same  action  move  in  a 
variety  of  different  ways,  thus  creating  a  non-deterministic  class  of  actions;  that  is, 
there  is  no  specific  or  deterministic  way  people  move  for  a  particular  actions. 
Even  the  movement  associated  with  a  single  action  performed  by  the  same  person 
varies  in  the  movements. 

This  paper  will  highlight  several  techniques  used  in  VHAR  research  to  include 
non-traditional  artificial  intelligence  techniques,  visual  languages,  statistical 
algorithms,  and  others.  This  paper  is  meant  to  be  a  survey  of  the  current  research 
being  performed  in  the  area  of  VHAR.  The  goal  of  this  paper  is  to  1.)  Educate  on 
VHAR  by  showing  a  wide  variety  of  innovative  methods  used  to  improve  on 
specific  areas  of  classification  related  to  VHAR,  and  to  2.)  Show  how  open  the 
field  of  research  is  to  new  techniques  that  will  improve  VHAR  types  of 
classification  systems.  The  papers  referenced  in  this  section  represent  a  fairly 
complete  list  of  ideas  used  to  advance  the  area  of  VHAR  and  related  fields  of 
research. 


Non-Traditional  Techniques 

A  large  amount  of  research  in  VHAR  uses  visual  cues  of  human  actions 
without  any  traditional  artificial  or  computational  intelligent  techniques  [10,  12, 
15,  17,  23,  25,  27,  31,  33,  34,  37,  43,  46,  47,  49,  50,  52,  55,  61,  62,  63],  These 
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algorithms  rely  on  simplicity  at  the  cost  of  fusing  input  data.  They  often  use  less 
than  typical  data  inputs;  that  is,  inputs  that  would  not  necessarily  be  used  by 
human  observers.  They  rely  almost  exclusively  on  the  pre-processing  of  the  data 
while  using  statistical  or  non-traditional  artificial  and  computational  intelligent 
algorithms  to  determine  the  behavior.  M.  Cristani  et  al.  [15]  uses  both  audio  and 
visual  data  to  determine  simple  events  in  an  office.  First  they  remove  foreground 
objects  and  segment  the  images  in  the  sequence.  This  output  is  coupled  with  the 
audio  data  and  a  threshold  detection  process  is  used  to  identify  unusual  events. 
The  fused  event  sequences  are  then  put  into  an  audio  visual  concurrence  matrix 
(A  VC)  to  compare  with  known  A  VC  events.  Patterns  of  the  events  from  the  A  VC 
are  known  and  determine  the  classification  of  the  action. 

Many  research  projects  in  this  area  use  their  own  form  of  plotting  space-time 
data  from  the  image  sequences  and  calculating  the  closest  distance  to  pre¬ 
determined  or  automatically  determined  events  to  decide  what  action  the  human  is 
performing  [10,  12,  16,  17,  19,  23,  25,  27,  30,  31,  34,  35,  43,  50,  55,  59],  M. 
Dimitrijevic  et  al.  [17]  developed  a  template  database  of  actions  based  on  five 
male  and  three  female  people.  Each  human  action  is  represented  by  3  frames  of 
their  2D  silhouette:  the  frame  when  the  person  first  touches  the  ground  with  one 
of  his/her  feet,  the  frame  at  the  midstride  of  the  step,  and  the  end  frame  when  the 
person  finishes  touching  the  ground  with  the  same  foot.  The  three  frame  sets 
were  taken  from  7  camera  positions.  For  classifying  the  action,  they  use  a 
modified  Chamfer’s  distance  calculation  to  match  to  the  template  sequences  in  the 
database. 

D.  Weiland  et  al.  [10]  use  motion  history  volumes  to  determine  human 
gestures  by  extending  the  2D  pixel  representation  with  time  to  a  3D  representation 
with  time.  This  is  accomplished  by  using  multiple  cameras  around  the  person  and 
subtracting  out  any  background  information.  Classes  are  created  manually  for 
each  action  or  gesture.  Mahalanobis  distance  with  principle  component  analysis 
is  used  to  identify  action  from  the  appropriate  class. 

The  work  from  A.  Mokhber  et  al.  [23]  also  uses  volumetric  models  to 
determine  human  actions.  Features  representative  of  the  action  sequence  are 
extracted  from  the  binary  images  and  used  to  compute  the  space-time  volumes. 
This  is  so  people  are  seen  globally.  The  volumes  are  formed  from  all  the  binary 
images  in  the  action  and  are  concatenated  together  in  chronological  order.  These 
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volumes  make  up  a  feature  vector  of  moments.  Comparison  of  testing  data  into 
classes  formed  from  the  training  data  is  by  Mahalanobis  distances. 

Eigenspace  is  used  to  help  categorize  actions  from  distance  computations 
which  identify  events  by  M.  Rahman  et  al.  [25].  They  believe  that  these  should  be 
used  since  they  are  highly  mathematical  and  require  less  image  processing. 
Motion  from  a  camera  is  captured  and  manually  placed  into  classes  and  then 
finally  into  a  covariance  matrix.  This  makes  up  the  universal  image  set  for  that 
action.  Eigenvalues  and  eigenvectors  are  calculated  and  by  using  the  Karhunen- 
Loeve  technique  the  best  ones  that  describe  the  action  are  kept.  This  makes  up  an 
orthogonal  coordinate  system.  To  recognize  an  unknown  image  sequence 
behavior,  a  distance  measure  is  used  for  the  calculated  eigenvalues  and 
eigenvectors  onto  the  coordinate  system.  This  study  looked  at  five  different 
cricket  events  with  fairly  good  recognition. 

Blackburn  and  Ribeiro  [27]  use  manifolds  from  isomorphic  feature  mapping 
(Isomaps)  to  represent  individual  images  of  motion  sequences.  Isomaps  are  used 
to  reduce  the  dimensionality  of  the  image  but  keep  most  of  the  features  required 
for  classification.  Scores  for  the  manifolds  are  calculated  and  the  curves  are  either 
stored  (in  training)  or  compared  against  classified  events  using  Dynamic  Time 
Warping  (DTW).  Classification  is  by  nearest  neighbor. 


Traditional  Artificial  Intelligent  Techniques 

Some  of  the  traditional  artificial  and  computational  intelligence  techniques  are 
used  for  classifying  human  action,  many  with  a  spin  towards  specific  motions  [13, 
19,  20,  26,  28,  35,  40,  42,  45,  58,  63].  J.-Y.  Yang  et  al.  [19]  uses  neural  networks 
to  determine  human  actions.  They  reduce  the  errors  associated  with  normal 
human  motion  capturing  by  placing  tags  on  body  parts  for  tracking.  They  also 
strap  a  tri-axial  accelerometer  to  the  subject’s  wrist  to  monitor  three  degrees  of 
motion  on  the  specific  body  part.  Tri-axial  data  is  captured  at  pre-determined 
time  intervals.  This  data  is  the  input  into  a  neural  network  specifically  designed 
to  determine  if  the  event  is  static,  like  standing  or  sitting,  or  the  event  is  dynamic, 
like  walking  and  running.  Once  the  event  is  determined  to  be  either  static  or 
dynamic,  another  neural  network  is  used  on  the  same  data,  either  a  static  event 
neural  network  or  a  dynamic  event  neural  network,  and  the  action  is  classified. 
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The  results  are  promising  for  the  limited  actions  the  system  is  designed  to  detect: 
standing,  sitting,  walking,  running,  vacuuming,  scrubbing,  brushing  teeth,  and 
working  on  a  computer. 

Rule  based  and  fuzzy  systems  are  a  few  other  common  types  of  artificial  and 
computational  intelligent  technique  used  to  identify  patterns  and  have  been 
adapted  to  analyze  human  events  [13,  16,  47].  H.  Stern  et  al.  [16]  created  a 
prototype  fuzzy  system  for  picture  understanding  of  surveillance  cameras.  His 
model  is  split  into  three  parts,  pre-processing  module,  a  static  object  fuzzy  system 
module,  and  a  dynamic  temporal  fuzzy  system  module.  The  static  fuzzy  system 
module  takes  in  the  pre-processed  data  and  outputs  the  number  of  people  involved 
in  the  scene:  a  single  person,  two  people,  three  people,  many  people,  or  no  people. 
The  dynamic  fuzzy  system  determines  the  intent  of  the  person,  or  people,  based 
on  their  global  temporal  movements.  Although  this  requires  only  a  basic 
understanding  of  human  intent  by  using  global  movements  of  people  and  their 
interactions  based  on  global  positions,  it  is  included  in  many  application  research 
programs  within  the  U.S.  Department  of  Defense:  Near  Autonomous  Unmanned 
System  [84],  Army  Research  Lab  Collaborative  Technology  Alliance  [85],  and 
Mobile  Detection  Assessment  and  Response  System  [86]. 


Markov  Models  and  Bayesian  Networks 

Developing  Markov  models  or  Bayesian  networks  are  common  approaches  to 
VHAR  research.  This  research  path  fits  the  logical  approach  of  having  a  sequence 
of  images  making  up  an  action.  Each  sequence  image  is  looked  at  together  with 
its  consecutive  image;  similar  to  how  a  human  recognizes  actions.  There  are 
several  ways  to  develop  a  network  from  the  input  data  [1,  8,  14,  18,  19,  30,  36,  38, 
42,  46,  48,  57].  In  the  work  of  Du,  Chen,  and  Xu  [1],  they  use  a  Coupled 
Hierarchal  Durational  -  State  Dynamic  Bayesian  Network  (CHDS-DBN)  to 
model  human  actions.  They  claim  that  to  understand  human  actions,  frameworks 
should  have  both  motion  corresponding  to  the  interaction  as  well  as  details  of  the 
motion  on  different  scales.  For  the  most  part,  research  did  not  include  interaction 
with  other  people  as  a  determinate  in  understanding  intent  of  a  single  person. 
Most  work  is  on  motion  characteristics  of  the  individual  alone.  This  approach 
adds  a  decision  base  to  normal  action  identification. 
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The  work  by  A.  Galata  et  al.  [8]  uses  variable  length  Markov  models  for 
human  action  recognition.  They  claim  that  using  variable  length  Markov  models 
provide  a  more  efficient  way  to  represent  behaviors  that  are  more  flexible  than 
other  common  classification  models  for  large  temporal  scale  input  data. 


Grammars 

In  constructing  networks,  many  VHAR  research  uses  grammars  [5,  22,  37,  41, 
44,  50,  52,  56,  57,  58,  59,  74]  to  best  describe  the  sequence  of  events  the  body 
makes  when  determining  the  actions  of  the  person  visually.  Grammars  are 
mathematical  based  and  seem  to  fit  well  with  visual  action  understanding  due  to 
their  network  fashion  of  solving  problems.  A.  Ogale  et  al.  [22]  uses  probabilistic 
context  free  grammars  (PCFG)  in  short  action  sequences  of  a  person  from  video. 
Body  poses  are  stored  as  silhouettes  which  are  used  in  the  construction  of  the 
PCFG.  Pairs  of  frames  are  constructed  based  on  their  time  slot:  the  body  pose 
from  frame  1  and  2  are  paired,  the  body  pose  from  frame  2  and  3  are  paired,  and 
so  on.  These  pairs  construct  the  PCFG  for  the  given  action.  When  testing  the 
algorithm,  the  same  procedure  is  followed.  Comparing  the  testing  data  with  the 
trained  data  is  accomplished  through  Bayes:  P(Sklpi)  =  P(pilsk)P(Sk)/P(pi),  where  sk 
is  the  kth  silhouette  and  p,  is  the  ith  pose. 

The  work  from  Stamer,  Weaver,  and  Pentland  [5]  also  use  grammars  in 
constructing  their  network.  Phrase  grammars  were  used  to  distinguish  the  type  of 
action  from  hand  signals  that  can  be  networked  together  to  form  the  meaning  of  a 
sentence  signed  using  American  Sign  Language.  In  this  case,  phrase  grammars 
limit  the  search  set  of  words  to  improve  the  accuracy  of  what  is  being  described. 
They  also  speed  up  the  process  over  not  using  grammars.  A  Hidden  Markov 
Model  is  used  to  train  and  test  the  data. 


Traditional  Hidden  Markov  Models 

Of  all  the  visual  human  action  recognition  networks  constructed,  Hidden 
Markov  Models  (HMM)  are  the  most  widely  used  [3,  4,  5,  7,  8,  9,  11,  20,  21,  24, 
26,  38,  45,  48,  51,  64,  80].  Hidden  Markov  Models  keep  a  network  of  body  poses 

related  to  each  other  and  provided  a  way  of  learning  parameters  that  best  fit  a  set 
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of  training  data  with  known  classifications.  Yamato  et  al.  [3]  use  HMMs  to 
recognize  six  tennis  strokes  with  a  25x25  mess  feature  matrix  to  describe  body 
positions  in  each  frame.  Campbell,  Becker,  and  Azarbayejani  [4]  used  HMMs  to 
recognize  eighteen  Tai  Chi  moves.  Each  move  was  represented  by  a  series  of 
vectors  formed  by  the  3D  position  of  the  head  and  the  hands.  Yu  and  Ballard  [9] 
use  HMMs  to  distinguish  similar  action  based  on  head  and  eye  movements.  The 
work  of  Lee  and  Kim  [7]  shows  how  to  use  HMMs  for  gesture  recognition.  A 
threshold  model  is  built  to  provide  dynamic  threshold  values  to  distinguish 
between  meaningful  and  meaningless  gestures.  If  the  action  is  within  a  pre¬ 
defined  threshold  then  it  is  considered  a  meaningful  action.  HMMs  are  used  to 
train  and  test  the  predefined  gestures.  Gehrig  and  Schulz  [24]  used  HMMs  to 
recognize  ten  kitchen  actions  based  on  the  movement  of  twenty  four  points  on  the 
upper  body.  They  looked  at  skeletal  data  and  calculated  the  correct  movements  of 
people  and  reduce  the  number  of  body  parts  down  to  thirteen  with  similar  results. 
Gao  et  al.  [26]  used  both  Optical  Flow  Tensors  (OFT)  and  HMMs  to  distinguish 
basketball  shot  actions  from  video.  Optical  flow  fields  are  modeled  from  the 
video  frames  at  several  resolutions  and  a  tensor  is  built,  this  is  the  OFT.  Reducing 
the  dimensionality  of  the  data  is  accomplished  by  first  applying  a  general  Tensor 
Discriminate  Analysis  Function  then  a  Linear  Discriminate  Analysis  function.  An 
HMM  is  used  to  train  and  test  the  final  features. 


Non-Traditional  Hidden  Markov  Models 

Other  forms  of  HMMs  have  been  developed  to  handle  more  specific  problems 
associated  with  HMM  based  action  recognition  systems  [2,  6,  11,  14,  18,  20,  21, 
28,  39,  51,  54,  60,  66,  70,  76,  77,  79,  81],  Wilson  and  Bobick  [6]  use  a 
Parametric  Hidden  Markov  Model  (PHMM)  to  recognize  gestures.  The  PHMM 
has  an  additional  parameter  used  to  represent  meaningful  variations  of  gestures 
across  the  set  of  all  gestures.  This  gives  PHMMs  the  ability  to  distinguish 
between  gesture  meanings  with  similar  hand  movements. 

Oliver  et  al.  [11]  developed  a  real  time  system  that  detects  and  classifies 
interactions  between  people  using  a  Coupled  Hidden  Markov  Model  (CHMM). 
They  used  synthetic  environments  to  model  person  to  person  interactions  and  thus 
creating  their  CHMM.  Data  from  a  static  camera  was  used  and  moving  objects 
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were  segmented  and  tracked.  Data  describing  the  location,  heading,  and  relative 
location  to  other  people  were  inputted  into  the  synthetically  created  CHMM  for 
analysis  and  classification  of  the  interaction  type.  Results  show  they 
outperformed  standard  HMMs.  This  is  not  a  far  stretch  since  standard  HMMs 
work  on  single  automatons  where  CHMMs  work  on  coupled  automatons,  thus 
HMMs  cannot  outperform  CHMMs  in  this  particular  environment. 

Multi-Observation  Hidden  Markov  Models  (MOHMM)  are  discussed  in  both 
[28]  and  [14]  from  Xaing  and  Gong.  In  [28]  they  use  MOHMMs  to  create 
breakpoints  in  the  video  content  of  an  activity.  Blobs  above  a  certain  threshold  in 
each  frame  are  segmented  from  the  pixel  change  history.  Several  functions  of 
these  blobs  are  used  in  the  feature  vector  to  classify  the  video  with  the  MOHMM. 
In  [14]  an  MOHMM  was  used  to  detect  piggybacking  of  people  off  someone 
else’s  security  card  to  open  a  secured,  card  access  only  door.  Piggybacking  is 
when  someone  follows  another  person  through  a  security  door  without  using 
his/her  security  card  to  open  it.  The  framework  of  the  system  allowed  for 
continual  changes  based  on  changes  in  peoples’  movements,  thus  unsupervised 
learning  is  used  to  continually  update  the  model. 

Gong  and  Xiang  [18]  developed  a  Dynamic  Multi-Linked  HMM  (DML- 
HMM)  to  recognize  group  activity  from  an  outdoor  scene.  The  DML-HMM  is 
based  on  salient  dynamic  inter-linkages  among  multiple  temporal  events  using 
Dynamic  Probabilistic  Networks  (DPN).  Standard  HMMs  cannot  take  into 
account  the  multiple  processes  needed.  The  DML-HMM  was  designed  to  handle 
the  multitude  of  different  object  events.  The  topology  is  determined  by  the 
causality  and  temporal  order,  which  was  automatically  made  using  the  Schwarz 
Bayesian  Information  Criterion  based  factorization.  They  claimed  that  instead  of 
being  fully  connected  like  Coupled  HMMs  (CHMM),  the  DHL-HMM  aims  to 
only  connect  a  subset  of  relevant  hidden  state  variables  across  multiple  temporal 
processes.  When  comparing  between  a  Multi-Observation  HMM  (MOHMM),  a 
Parallel  HMM  (PaHMM),  and  a  CHMM,  the  DHL-HMM  performs  better  since 
the  CHMM  and  the  MOHMM  propagates  the  noise  through  the  systems  and  the 
PaHMM  discards  correlations  between  multiple  temporal  processes. 

Continuous  HMMs  (cHMMs)  are  used  in  the  work  of  Antonakaki  et  al.  [20]. 
Their  work  classifies  abnormal  behavior  of  people  based  on  both  their  short  term 
behavior  and  the  global  trajectory  of  each  subject.  A  short  term  behavior  is  a 
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behavior  that  can  be  classified  in  twenty  five  frames,  or  one  second.  A  one  class 
support  vector  machine  (SVM)  is  used  to  distinguish  abnormal  behavior  from  the 
short  term  behavior  sequence.  For  trajectory  data,  a  one  class  cHMM  is  used  to 
determine  if  the  person’s  movement  is  abnormal.  Both  are  used  to  determine  the 
final  results. 

Layered  Hidden  Markov  Models  (LHMMs)  are  used  in  Oliver  et  al.  [21]  to 
detect  specific  activities  in  an  office  environment.  They  employ  a  two  level 
cascade  of  HMMs  with  three  processing  layers.  The  first  layer  captures  video, 
audio  and  keyboard/mouse  activity  to  create  the  first  level  feature  vector.  The 
middle  layer  has  two  HMMs,  one  for  creating  an  audio  feature  vector  and  one  for 
creating  a  video  feature  vector.  The  top  layer  uses  the  results  of  these  HMMs 
along  with  keyboard/mouse  activity  and  the  derivative  of  the  sound  localization 
component  as  the  final  feature  vector.  The  results  from  this  top  layer  determine 
the  activity  in  the  office.  They  claim  the  LHMM  makes  it  feasible  to  decouple 
different  levels  of  analysis  for  training  and  inferences.  By  using  a  single  HMM  it 
would  need  a  large  parameter  space,  thus  need  a  large  amount  of  data  to  train. 
Also,  a  single  HMM  would  not  be  robust  enough  to  move  to  a  different  office 
without  retraining,  unlike  the  LHMM  claims. 

Liu  and  Chua  [2]  use  Observation  Decomposed  Hidden  Markov  Model 
(ODHMM)  to  model  and  classify  multiple  people’s  activities.  They  state  that  to 
automatically  recognize  multi  agents  (person,  extremity,  or  object)  is  very 
challenging  due  to  the  complexity  of  interactions  between  agents.  This 
complexity  stems  from  the  large  dimensionality  of  the  feature  vectors  and  the 
complex  mapping  of  agents  from  input  data  to  pre-defined  activity  models.  To 
handle  this  problem,  they  decompose  each  feature  vector  into  a  set  of  sub-feature 
vectors  for  the  ODHMM. 

Del  Rose  et  al.  [70]  developed  the  Evidence  Feed  Forward  HMM  to  better 
identify  patterns  in  the  observation  data  for  identification  of  actions.  This  is  a 
divergence  from  common  HMM  theory  which  would  normally  assume  that 
observation  to  observation  linkages  upsets  the  rule  of  causality  of  the  system. 
However,  this  is  not  the  case  since  these  linkages  are  viewed  as  another 
probability  associated  with  the  closed  system  and  offer  better  results  than  other 
HMMs. 
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Summary  of  Visual  Human  Intent  Analysis 
Development 

Across  all  the  previous  research,  several  common  requirements  stand  out. 

-  Detect  and  segment  objects  in  each  frame 

-  Determine  relative  position  and  orientation  of  each  object 

-  Identify  a  meaningful  sequence  of  frames  from  the  visual  input 

-  Store  and  retrieve  past  sequence  of  behaviors  for  identification  of  current 
ones. 

To  detect  and  segment  out  objects  in  each  frame,  there  needs  to  be  a  well 
developed  image  processing  set  of  tools.  If  the  goal  is  to  identify  the  arm/hand 
movements  like  in  [5]  to  classify  American  Sign  Language  words  from  visual 
identification  of  the  hands,  then  the  image  processing  is  very  important.  Any 
misprocessing  of  the  input  data  that  goes  into  the  classifier  may  cause  inaccuracy. 

Determining  the  relative  position  and  orientation  of  the  object  also  requires 
good  image  processing  techniques.  In  some  instances,  just  the  movement  is 
important  and  not  the  exact  location  from  frame  to  frame,  as  in  [11].  This  case 
requires  less  analysis  of  the  processed  input  data  into  the  classifier  than,  say,  [10] 
where  body  orientation  is  important  to  match  with  preselected  human  actions 
from  several  angles. 

To  identify  a  meaningful  sequence  of  frames,  stops  should  be  placed  before 
and  after  each  interesting  area.  There  is  a  large  amount  of  research  just  on  finding 
separations  in  actions.  If,  for  example,  HMMs  are  to  be  used  and  codebooks  are 
required  to  identify  common  actions  then  having  equal  length  action  sequences 
for  comparison  is  important.  This  would  require  either  adding  to  the  processed 
sequence  several  inputs  or  subtracting  parts  in  the  middle  which  require  little 
attention,  like  slow  movements.  Either  way,  the  intelligence  of  the  algorithm 
would  have  to  greatly  increased  which  requires  a  lot  of  work  in  automating.  In 
[17],  they  have  taken  out  three  frames  that  describe  the  action  from  a  small  set. 
To  automate  this  process,  it  requires  a  lot  of  image  processing  to  correctly  identify 
the  starting,  middle  and  ending  location  of  a  person’s  stride. 

Identification  of  sequence  could  also  mean  sequences  that  have  no  visual 
information  past  on,  like  in  [27].  They  have  identified  a  way  of  processing  the 
visual  information  to  a  point  where  only  a  few  values  represent  the  sequence.  The 
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reader  is  caution  that  the  further  away  the  data  is  from  its  original  values,  the  more 
errors  are  introduced  in  the  system. 

Finally,  to  store  past  behaviors,  there  must  be  knowledge  of  the  behavior.  In 
intelligent  systems,  this  is  usually  done  through  learning;  however,  it  can  also  be 
done  through  human  intervention.  If  human  intervention  is  performed  on  setting 
up  parameters  for  known  behaviors,  then  often  times  patterns  that  are  sometimes 
found  through  the  learning  process  is  missed,  thus  causing  misclassification  of 
actions.  It  is  suggested  that  a  combination  of  both  computer  learning  and  human 
interaction  is  used.  This  requires  heavy  analysis  on  the  training  and  testing  data  to 
completely  encompass  the  range  of  each  activity  being  performed,  a  large  data  set 
to  perform  several  iterations  of  training  and  testing  of  data,  and  in-depth 
knowledge  of  the  minor  facets  of  the  action.  Once  the  user  has  concluded  that  the 
training  data  is  complete  and  the  baseline  action  sequences  are  stored,  then  to 
retrieve  them  takes  nothing  more  than  a  comparison  of  a  newly  processed  action 
to  the  most  likely  candidate  stored. 


Conclusion 

Much  of  the  human  brain  is  set  aside  for  processing  the  visual  sense.  As 
computing  power  has  continually  increased,  and  as  ever  great  push  is  made  for 
efficiencies  in  business  and  government,  letting  automated  computers  perform 
heretofore  human  visual  tasks  could  lead  to  great  efficiencies  and  improvements. 
Visual  Human  Intent  Analysis  (VHIA)  is  a  wide  open  field  of  research  with 
several  different  methods  that  relate  to  individual  solutions,  like  identifying  hand 
gestures,  understanding  different  tennis  strokes,  identifying  office  activities,  and 
many  others.  Each  method  described  above  plays  on  the  solution’s  strengths  and 
weaknesses  whether  it  is  simplified  classification  with  heavy  pre-processing  or  a 
more  intelligent  decision  system.  The  wide  range  of  methods  demonstrates  the 
openness  to  new  and  innovative  solutions  that  are  catered  to  one’s  own  problem. 

However,  with  all  the  different  techniques  there  are  four  common  tasks  which 
every  VHIA  systems  must  perform:  detect  and  segment  objects  in  each  frame, 
determine  relative  position  and  orientation  of  each  object,  identify  a  meaningful 
sequence  of  frames  from  the  visual  input,  and  store  and  retrieve  past  sequence  of 
behaviors  for  identification  of  current  ones.  These  tasks  are  performed  in 
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different  ways  depending  on  the  type  of  classification  system.  They  require  a  lot 
of  image  processing,  analysis  on  the  input  data,  and  in-depth  knowledge  of  the 
actions  with  respect  to  the  processed  data. 
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